Multi-thread processor with multi-bank branch-target buffer

ABSTRACT

A processor includes a pipeline and a multi-bank Branch-Target Buffer (BTB). The pipeline is configured to process program instructions including branch instructions. The multi-bank BTB includes a plurality of BTB banks and is configured to store learned Target Addresses (TAs) of one or more of the branch instructions in the plurality of the BTB banks, to receive from the pipeline simultaneous requests to retrieve respective TAs, and to respond to the requests using the plurality of the BTB banks in the same clock cycle.

FIELD OF THE INVENTION

The present invention relates generally to processor design, andparticularly to processors having multi-bank branch-target buffers.

BACKGROUND OF THE INVENTION

Some processors store learned Target Addresses (TAs) ofpreviously-encountered branch instructions in a Branch-Target Buffer(BTB). The BTB is used for efficient processing of subsequentoccurrences of such branch instructions.

SUMMARY OF THE INVENTION

An embodiment of the present invention that is described herein providesa processor including a pipeline and a multi-bank Branch-Target Buffer(BTB). The pipeline is configured to process program instructionsincluding branch instructions. The multi-bank BTB includes a pluralityof BTB banks and is configured to store learned Target Addresses (TAs)of one or more of the branch instructions in the plurality of the BTBbanks, to receive from the pipeline simultaneous requests to retrieverespective TAs, and to respond to the requests using the plurality ofthe BTB banks in the same clock cycle.

In some embodiments, the pipeline includes multiple hardware threads,and the multi-bank BTB is configured to receive the simultaneousrequests from two or more of the hardware threads.

In some embodiments, the multi-bank BTB is configured to (i) detect thattwo or more of the requests pertain to the same BTB bank, (ii) arbitratebetween the requests that access the same BTB bank, so as to select awinning request, (iii) retrieve a TA from the same BTB bank in responseto the winning request, and (iv) provide the retrieved TA to a hardwarethread that issued the winning request. In an embodiment, the multi-bankBTB is configured to multicast the retrieved TA to the hardware threadthat issued the winning request, and to at least one other hardwarethread that did not issue the winning request. In an embodiment, therequests that access the same BTB bank pertain to at least two differentsoftware threads.

In other embodiments, the multi-bank BTB includes a read buffer, whichis configured to store one or more TAs that were previously retrievedfrom at least one of the BTB banks, and the multi-bank BTB is configuredto serve one or more of the requests from the read buffer instead offrom the BTB banks. In an embodiment, the multi-bank BTB is configuredto attempt serving a request from the read buffer, in response to therequest losing arbitration for access to the BTB banks. In anotherembodiment, the multi-bank BTB is configured to initially attemptserving a request from the read buffer, and, in response to failing toserve the request from the read buffer, to serve the request from atleast one of the BTB banks. In yet another embodiment, the multi-bankBTB is configured to attempt serving a request from the read buffer,and, at least partially in parallel, to attempt serving the request fromthe BTB banks.

In a disclosed embodiment, a given hardware thread is configured toissue a request to the multi-bank BTB only if the given hardware threadis not stalled for more than a predefined number of clock cycles. In anexample embodiment, when a given hardware thread is stalled butnevertheless issues a request to the multi-bank BTB, the given hardwarethread is configured to store a TA that was returned in response to therequest, and to subsequently access the stored TA instead of accessingthe BTB banks. In still another embodiment, when a given hardware threadis stalled, the given hardware thread is configured to delay access tothe multi-bank BTB.

There is additionally provided, in accordance with an embodiment of thepresent invention, a method including, in a processor, processingprogram instructions, including branch instructions, using a pipeline.Learned Target Addresses (TAs) of one or more of the branch instructionsare stored in a multi-bank Branch-Target Buffer (BTB) that includes aplurality of BTB banks. Simultaneous requests to retrieve respective TAsare received from the pipeline. The requests are responded to using theplurality of the BTB banks in the same clock cycle.

The present invention will be more fully understood from the followingdetailed description of the embodiments thereof, taken together with thedrawings in which:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram that schematically illustrates a processorcomprising multiple hardware threads and a multi-bank BTB, in accordancewith an embodiment of the present invention;

FIG. 2 is a block diagram that schematically illustrates a multi-bankBTB, in accordance with an embodiment of the present invention; and

FIG. 3 is a flow chart that schematically illustrates a method forserving multiple hardware threads by a multi-bank BTB, in accordancewith an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS Overview

Embodiments of the present invention that are described herein provideimproved methods and apparatus for storing and using previously-learnedTarget Addresses (TAs) of branch instructions in a processor. Inparticular, disclosed embodiments provide multi-bank Branch-TargetBuffers (BTBs) that enable multiple hardware threads to retrieve TAs inthe same clock cycle.

In some embodiments, a processor comprises a processing pipelinecomprising multiple hardware threads, and a multi-bank BTB comprising aplurality of BTB banks. Each BTB bank typically comprises a single-portmemory that supports one read operation per clock cycle. Each BTB bankholds multiple BTB entries, each BTB entry comprising a TA correspondingto a respective Program Counter (PC) value. Each BTB bank is typicallyassigned a respective sub-range of the possible PC values, e.g., byusing a subset of the PC bits as a “bank select” signal. Any suitablesubset of bits can be used for this purpose.

The multi-bank BTB is configured to receive from two or more of thehardware threads simultaneous requests to retrieve respective TAs, andto respond to the requests using the plurality of the BTB banks in thesame clock cycle. Each request specifies a PC value of a branchinstruction for which the TA is requested. As long as the PC values inthe requests are mapped to different BTB banks, no collision occurs, andthe multi-bank BTB serves each request by accessing the respective BTBbank. If two (or more) requests specify PC values that are mapped to thesame BTB bank, arbitration logic in the multi-bank BTB selects which ofthe requests will be served. The other request or requests are deferredto a later clock cycle.

In some embodiments, the multi-bank BTB employs additional measures forreducing the probability of collision, or for reducing the impact ofcollision on the requesting hardware threads. For example, when servinga BTB entry to a thread that won the arbitration, the multi-bank BTB maydistribute (“multi-cast”) the BTB entry to one or more other threadsthat lost the arbitration, as well. As another example, in someembodiments the multi-bank BTB comprises an additional read buffer,which caches recently-accessed BTB entries. The multi-bank BTB may servea request from the read buffer, for example, when the request losesarbitration, or in parallel with attempting to access the BTB banks.

The disclosed multi-bank BTB configurations provide a highly-efficientmeans of buffering TAs for concurrent use by multiple hardware threads.The disclosed techniques perform well in scenarios that are prone tocollision, e.g., when multiple hardware threads process differentiterations of the same program loop in parallel. Other use-cases includemultiple different programs or software threads that run on the sameprocessor.

System Description

FIG. 1 is a block diagram that schematically illustrates a processor 20,in accordance with an embodiment of the present invention. Processor 20comprises multiple hardware threads 24 that are configured to operate inparallel. In the present example the processor comprises two hardwarethreads denoted 24A (also referred to as THREAD#0) and 24B (alsoreferred to as THREAD#1). Generally, however, processor 20 may compriseany suitable number of threads.

In the example of FIG. 1, each thread 24 is configured to process one ormore respective segments of the code. Certain aspects of threadparallelization are addressed, for example, in U.S. patent applicationSer. Nos. 14/578,516, 14/578,518, 14/583,119, 14/637,418, 14/673,884,14/673,889, 14/690,424, 14/794,835, 14/924,833, 14/960,385 and15/196,071, which are all assigned to the assignee of the present patentapplication and whose disclosures are incorporated herein by reference.

In some embodiments, each hardware thread 24 comprises a fetching module28, a decoding module 32 and a renaming module 36. Fetching modules 24fetch the program instructions of their respective code segments from amemory, e.g., from a multi-level instruction cache. In the presentexample, processor 20 comprises a memory system 41 for storinginstructions and data. Memory system 41 comprises a multi-levelinstruction cache comprising a Level-1 (L1) instruction cache 40 and aLevel-2 (L2) cache 42 that cache instructions stored in a memory 43.

In a given thread 24, the fetched instructions are provided from theoutput of fetching module 28 to decoding module 32. Decoding module 32decodes the fetched instructions. The decoded instructions provided bydecoding module 32 are typically specified in terms of architecturalregisters of the processor's instruction set architecture. The decodedinstructions are provided to renaming module 36, which carries outregister renaming.

Processor 20 comprises a register file that comprises multiple physicalregisters. The renaming modules of the various hardware threadsassociate each architectural register in the decoded instructions to arespective physical register in the register file (typically allocatesnew physical registers for destination registers, and maps operands toexisting physical registers).

The renamed instructions (e.g., the micro-ops/instructions output byrenaming modules 36) are buffered in-order in one or more ReorderBuffers (ROB) 44, also referred to as Out-of-Order (OOO) buffers. Inalternative embodiments, one or more instruction queue buffers are usedinstead of ROB. The buffered instructions are pending for out-of-orderexecution by multiple execution modules 52, i.e., not in the order inwhich they have been fetched. In alternative embodiments, the disclosedtechniques can also be implemented in a processor that executes theinstructions in-order.

The renamed instructions buffered in ROB 44 are scheduled for executionby the various execution units 52. Instruction parallelization istypically achieved by issuing one or multiple (possibly out of order)renamed instructions/micro-ops to the various execution units at thesame time. In the present example, execution units 52 comprise twoArithmetic Logic Units (ALU) denoted ALU0 and ALU1, aMultiply-Accumulate (MAC) unit, two Load-Store Units (LSU) denoted LSU0and LSU1, a Branch execution Unit (BRU) and a Floating-Point Unit (FPU).In alternative embodiments, execution units 52 may comprise any othersuitable types of execution units, and/or any other suitable number ofexecution units of each type. The cascaded structure of threads 24(including fetch modules 28, decoding modules 32 and renaming modules36), ROB 44 and execution units 52 is referred to herein as the pipelineof processor 20.

The results produced by execution units 52 are saved in the registerfile, and/or stored in memory system 41. In some embodiments the memorysystem comprises a multi-level data cache that mediates betweenexecution units 52 and memory 43. In the present example, themulti-level data cache comprises a Level-1 (L1) data cache 56 and L2cache 42.

In some embodiments, the Load-Store Units (LSU) of processor 20 storedata in memory system 41 when executing store instructions, and retrievedata from memory system 41 when executing load instructions. The datastorage and/or retrieval operations may use the data cache (e.g., L1cache 56 and L2 cache 42) for reducing memory access latency. In someembodiments, high-level cache (e.g., L2 cache) may be implemented, forexample, as separate memory areas in the same physical memory, or simplyshare the same memory without fixed pre-allocation.

A branch/trace prediction module 60 predicts branches or flow-controltraces (multiple branches in a single prediction), referred to herein as“traces” for brevity, that are expected to be traversed by the programcode during execution by the various threads 24. Based on thepredictions, branch/trace prediction module 60 instructs fetchingmodules 28 which new instructions are to be fetched from memory. Asnoted above, the instructions being fetched are divided by the controlcircuitry into groups of instructions referred to as segments, e.g.,based on branch or trace prediction. Branch/trace prediction in thiscontext may predict entire traces for segments or for portions ofsegments, or predict the outcome of individual branch instructions.

In some embodiments, processor 20 comprises a segment management module64. Module 64 monitors the instructions that are being processed by thepipeline of processor 20, and constructs an invocation data structure,also referred to as an invocation database 68. Typically, segmentmanagement module 64 decides how to divide the stream of instructionsbeing fetched into segments, e.g., when to terminate a current segmentand start a new segment. In an example non-limiting embodiment, module64 may identify a program loop or other repetitive region of the code,and define each repetition (e.g., each loop iteration) as a respectivesegment. Any other suitable form of partitioning into segments, notnecessarily related to the repetitiveness of the code, can also be used.

Invocation database 68 divides the program code into traces, andspecifies the relationships between them. Module 64 uses invocationdatabase 68 for choosing segments of instructions to be processed, andinstructing the pipeline to process them. Database 68 is typicallystored in a suitable internal memory of the processor. The structure andusage of database 68 is described in detail in U.S. patent applicationSer. No. 15/196,071, cited above.

Since fetching modules 28 fetch instructions according to branch/tracepredictions, and according to traversal of invocation database 68,instructions are generally fetched out-of-order, i.e., in an order thatdiffers from the sequential order of appearance of the instructions inthe code.

Serving Multiple Hardware Threads Using a Multi-Bank BTB

In some embodiments described herein, processor 20 further comprises amulti-bank Branch-Target Buffer (BTB) 72. BTB 72 stores learned TargetAddresses (TAs) of branch instructions. In the examples describedherein, the TAs in BTB 72 are accessed by Program Counter (PC) value.Alternatively, the TAs in BTB 72 may be addressed/accessed in any othersuitable way, for example by a subset of the PC bits or by GlobalHistory Register (GHR) value. Typically, when identifying a branchinstruction having a fixed TA (or, in the case of GHR, a flow-controltrace leading to a branch instruction), the processor creates a newentry in BTB 72, for later use.

Each entry in BTB 72 comprises a PC value of a branch instruction thatwas processed in the past by the processor pipeline and found to have afixed TA, and the TA of that branch instruction. The term “TargetAddress” refers to the PC value of the next instruction to be processedif the branch is taken.

BTB 72 is typically queried by fetch modules 28. If the BTB has amatching entry, the PC can be set immediately to the next instruction tobe fetched, without a need to wait until the branch instruction isdecoded.

As will be explained below, the multi-bank structure of BTB 72 allowsmultiple hardware threads 24 to access the BTB in the same clock cyclewith a very small probability of collision. In the present example, BTB72 serves two hardware threads 24A and 24B. Generally, however, thedisclosed techniques can be applied in a similar manner to serve alarger number of hardware threads.

In the present embodiment, BTB 72 can receive two PC values in a givenclock cycle—PC0 from thread 24A and PC1 from thread 24B. BTB 72 providesfour outputs per clock cycle: HIT0 (a signal indicating whether an entryfor a branch instruction at PC0 exists in the BTB or not), TAO (the TAof the branch instruction at PC0, if such an entry exists), HIT1 (asignal indicating whether an entry for a branch instruction at PC1exists in the BTB or not), and TA1 (the TA of the branch instruction atPC1, if such an entry exists). In alternative embodiments, BTB 72 may beconfigured to receive any other suitable number of requests (and thusreturn any suitable number of TAs) per clock cycle.

FIG. 2 is a block diagram that schematically illustrates the internalstructure of multi-bank BTB 72, in accordance with an embodiment of thepresent invention. In the present example, BTB 72 comprises an array 76of BTB banks 80 denoted BANK-0, . . . , BANK-N. Typically, both thearray and its control are banked, to allow multiple simultaneousaccesses. Each BTB bank 80 comprises a suitable memory, typically asingle-port Random-Access Memory (RAM) that can be accessed (read orwritten) once per clock cycle. In alternative embodiments, BTB banks 80may comprise any other suitable type of memory, e.g., dual-port RAM orContent-Addressable Memory (CAM).

Each BTB bank 80 stores multiple entries, each entry comprising a PCvalue of a previously-encountered branch instruction, and the learned TAof that branch instruction. Each BTB bank 80 receives a PC value asinput, and returns a HIT output (indicating whether an entry for abranch instruction at this PC value exists in the BTB bank or not) and aTA output (the TA of the branch instruction, if such an entry exists).

In an example embodiment, array 76 comprises multiple BTB banks. EachBTB bank holds up to 2K entries, arranged in a set-associative mannerwith a Least-Recently Used (LRU) replacement policy. This configurationis given purely by way of example. In alternative embodiments, BTB 72may comprise any other suitable number of BTB banks, of any suitablesize and configuration.

Typically, each BTB bank is configured to store the TAs for a respectivesub-range of PC values. In an example embodiment, each BTB bank 80 isorganized as a Content-Addressable Memory (CAM). In order to look-up aPC value in BTB 72, one or more Lower-Significance Bits (LSBs) of the PCvalue are used for selecting the appropriate BTB bank 80 in array 76,and the remaining bits (or the entire PC value) used as a key or tag foraccessing the selected BTB bank. If the selected BTB bank holds an entryfor the PC value in question (the key or tag), the BTB bank returns thecorresponding TA, and also asserts the HIT output. If not, the BTB bankde-asserts the HIT output. In other words, the BTB may be configuredeither as a fully-associative or as a set-associative CAM.

BTB 72 further comprises arbitration logic 84, and select logic 92A and92B. Arbitration logic 84 arbitrates conflicts in access to BTB banks 80using any kind of arbitration policy, e.g., Round Robin. Select logic92A receives TAs that are retrieved from array 76 for thread 24A,outputs them as TAO, and asserts or de-asserts HIT0 as needed. Selectlogic 92B receives TAs that are retrieved from array 76 for thread 24B,outputs them as TA1, and asserts or de-asserts HIT1 as needed.

In a given clock cycle, BTB 72 may receive up to two queries for targetaddresses—PC0 from hardware thread 24A, and PC1 from hardware thread24B. If PC0 and PC1 are mapped to different BTB banks 80, no collisionoccurs. The BTB bank corresponding to PC0 provides the TA requested bythread 24A (if one exists) to select logic 92A. In parallel, the(different) BTB bank corresponding to PC1 provides the TA requested bythread 24B (if one exists) to select logic 92B.

If, on the other hand, both PC0 and PC1 are mapped to the same BTB bank80, a collision may occur, since each BTB bank can only be accessed onceper clock cycle. In such a case, arbitration logic 84 chooses one of thetwo queries (either PC0 from thread 24A, or PC1 from thread 24B) to beserved in the present clock cycle. The winning hardware thread isgranted access to the BTB and is provided with the result. The otherhardware thread is not served, and may be stalled until a later clockcycle.

When arbitrating between threads 24A and 24B, arbitration logic 84 mayapply any suitable arbitration scheme, such as, for example, aRound-Robin scheme or other fairness-based scheme, or a scheme thatgives priority to one hardware thread over the other.

In some practical scenarios, for example when hardware threads 24A and24B process different iterations of the same program loop in parallel,there is high probability of collision between the two threads.Moreover, there is high likelihood that the two threads will not onlyaccess the same BTB bank, but will also query the same PC value. Such ascenario may persist for a relatively long time period.

In some embodiments, BTB 72 resolves the above scenario, as well asother possible undesired scenarios, by providing the TA requested by thewinning thread to the other threads. In the present two-thread example,if thread 24A wins the arbitration, BTB 72 may provide PC0 and TAO toboth threads 24A and 24B. If thread 24B wins the arbitration, BTB 72 mayprovide PC1 and TA1 to both threads 24A and 24B. In this manner, if PC1is relevant to the thread that lost the arbitration, it may continueprocessing and not stall.

In some embodiments, BTB 72 further comprises a read buffer 88, whichcaches recently-accessed BTB entries. By looking-up the read buffer, itis possible to further reduce the likelihood of collision in access toBTB banks 80, and/or resolve collisions as they occur. In an exampleembodiment, upon successfully accessing a certain entry (PC value andTA) in BTB banks 80, BTB 72 adds the entry to read buffer 88 (assumingthere is free space in the buffer). The BTB may evict stale entries fromread buffer 88 in accordance with any suitable policy, e.g., LeastRecently Used (LRU) or Least Frequently Used (LFU) policy.

In one embodiment, when a certain thread loses the arbitration andcannot access BTB banks 80, BTB 72 instead attempts to find therequested TA in read buffer 88. If found, BTB 72 serves the request tothe losing thread, i.e., provides the requested TA to the losing thread,from the read buffer instead of from the BTB banks. The requestinghardware thread may thus continue processing and not stall, even thoughit had lost the arbitration. In another embodiment, BTB 72 looks-up readbuffer 88 before attempting to access BTB banks 80. Only if therequested TA is not found in the read buffer, BTB proceeds to try andaccess the BTB banks. In yet another embodiment, BTB looks-up readbuffer 88 in parallel with, or slightly after, attempting to access BTBbanks 80.

In addition to reducing the likelihood of collision, the use of buffer88 also reduces power consumption, since access to the buffer typicallyconsumes much less power than access to the BTB banks. The use of readbuffer 88 is highly advantageous in scenario characterized by highprobability of collision, e.g., when two threads execute the same loop.Nevertheless, read buffer 88 may improve performance in various otherscenarios.

In some embodiments, both threads 24A and 24B process the same softwarethread. In other embodiments, threads 24A and 24B may process differentsoftware threads (i.e., different software programs, different code).

In some embodiments, a given hardware thread (24A or 24B) issuesrequests to BTB 72 only if the hardware thread is not stalled. In otherwords, in some embodiments stalled hardware threads do not access theBTB. In an embodiment, a thread will refrain from issuing requests toBTB 72 only if stalled for more than a predefined number of consecutivecycles. In other words, a stall of a single cycle (or a predefined smallnumber of cycles) does not necessarily prevent a thread from competingfor BTB access.

In some embodiments, even if a certain hardware thread is stalled, e.g.,due to a data cache miss, it may still attempt to access BTB banks 80.In other words, arbitration and BTB access may still be performed forstalled hardware thread. If a stalled thread wins the arbitration, thethread may store the TA provided by BTB 72. Subsequently, e.g., when nolonger stalled, the thread may access the stored TA instead of accessingBTB banks 80. The stalled thread may store the TA in any suitablelocation, e.g., in a separate buffer (not shown), or in read buffer 88.If the TA is stored in read buffer 88, the thread may store it alongwith a mark that prevents evicting this TA from the read buffer untilthe thread in question is no longer stalled and uses the TA. Arbitrationlogic 84 prevents the stalled hardware thread from participating inarbitration until at least one cycle after the stall is over. Once thestall is over, the thread retrieves the requested TA from its storagelocation, and therefore has no need to access the BTB banks in thatcycle.

In an embodiment, in case of collision in access to the same BTB bank,arbitration logic 84 may give priority to a thread that is not stalled,over a thread that is stalled.

The configurations of processor 20 and of its various components, e.g.,BTB 72, shown in FIGS. 1 and 2, are example configurations that arechosen purely for the sake of conceptual clarity. In alternativeembodiments, any other suitable configurations can be used. For example,parallelization can be performed in any other suitable manner, or may beomitted altogether. The processor may be implemented without cache orwith a different cache structure. The processor may comprise additionalelements not shown in the figure. Further alternatively, the disclosedtechniques can be carried out with processors having any other suitablemicro-architecture. As another example, it is not mandatory that theprocessor perform register renaming.

As yet another example, the distinction among different hardware threadsis not mandatory. In alternative embodiments, the pipeline of processor20 may have any other suitable structure. BTB 72 may receive and processmultiple requests per clock cycle from any elements of the pipeline.

Processor 20 can be implemented using any suitable hardware, such asusing one or more Application-Specific Integrated Circuits (ASICs),Field-Programmable Gate Arrays (FPGAs) or other device types.Additionally or alternatively, certain elements of processor 20 can beimplemented using software, or using a combination of hardware andsoftware elements. The instruction and data cache memories, as well asBTB banks 80 and read buffer 88, can be implemented using any suitabletype of memory, such as Random Access Memory (RAM).

Processor 20 may be programmed in software to carry out the functionsdescribed herein. The software may be downloaded to the processor inelectronic form, over a network, for example, or it may, alternativelyor additionally, be provided and/or stored on non-transitory tangiblemedia, such as magnetic, optical, or electronic memory.

FIG. 3 is a flow chart that schematically illustrates a method forserving multiple hardware threads 24 by multi-bank BTB 72, in accordancewith an embodiment of the present invention. The method begins with BTB72 receiving a request from THREAD#0 for the TA corresponding to PC0, ata first request step 100, and a request from THREAD#1 for the TAcorresponding to PC1, at a second request step 104.

At a buffer checking step 108, BTB 72 checks whether an entrycorresponding to PC0 or to PC1 exists in read buffer 88. If so, nocollision occurs. BTB 72 returns TAO to THREAD#0, and TA1 to THREAD#1,at a first output step 112. The TAs may be served from read buffer 88and/or from BTB banks 80, depending on where each TA is found.

If, on the other hand, PC0 and PC1 have no matching entries in readbuffer 88, BTB 72 checks whether PC0 and PC1 are mapped to the same BTBbank 80 or to different BTB banks, at a bank checking step 116. If PC0and PC1 are mapped to different BTB banks 80, again no collision occurs.In such a case, at a second output step 120, BTB 72 returns TAO toTHREAD#0 from one BTB bank 80, and returns TA1 to THREAD#1 from adifferent BTB bank 80.

If step 116 finds that both PC0 and PC1 are mapped to the same BTB bank80, arbitration logic 84 arbitrates between the two requests, at anarbitration step 124. BTB 72 provides the requested TA to the winningthread from the BTB bank. Optionally, as described above, BTB 72 maymulticast the TA to the losing thread, as well. BTB 72 may also updateread buffer 88 with the accessed entry (PC and TA).

The method flow of FIG. 3 is an example flow that is depicted purely forthe sake of clarity. In alternative embodiments, any other suitablemethod flow can be used. For example, the use of read buffer 88 and/ormulticasting of the winning TA may be omitted.

It will be appreciated that the embodiments described above are cited byway of example, and that the present invention is not limited to whathas been particularly shown and described hereinabove. Rather, the scopeof the present invention includes both combinations and sub-combinationsof the various features described hereinabove, as well as variations andmodifications thereof which would occur to persons skilled in the artupon reading the foregoing description and which are not disclosed inthe prior art. Documents incorporated by reference in the present patentapplication are to be considered an integral part of the applicationexcept that to the extent any terms are defined in these incorporateddocuments in a manner that conflicts with the definitions madeexplicitly or implicitly in the present specification, only thedefinitions in the present specification should be considered.

The invention claimed is:
 1. A processor, comprising: a pipelineconfigured to process program instructions including branchinstructions; and a multi-bank Branch-Target Buffer (BTB), whichcomprises a plurality of BTB banks and is configured to store learnedTarget Addresses (TAs) of one or more of the branch instructions in theplurality of the BTB banks, to receive from the pipeline simultaneousrequests to retrieve respective TAs, and to respond to the requestsusing the plurality of the BTB banks in the same clock cycle, whereinthe multi-bank BTB comprises a read buffer, which is configured to storeone or more TAs that were previously retrieved from at least one of theBTB banks, and wherein the multi-bank BTB is configured to serve one ormore of the requests from the read buffer instead of from the BTB banks.2. The processor according to claim 1, wherein the pipeline comprisesmultiple hardware threads, and wherein the multi-bank BTB is configuredto receive the simultaneous requests from two or more of the hardwarethreads.
 3. The processor according to claim 1, wherein the multi-bankBTB is configured to attempt serving a request from the read buffer, inresponse to the request losing arbitration for access to the BTB banks.4. The processor according to claim 1, wherein the multi-bank BTB isconfigured to initially attempt serving a request from the read buffer,and, in response to failing to serve the request from the read buffer,to serve the request from at least one of the BTB banks.
 5. Theprocessor according to claim 1, wherein the multi-bank BTB is configuredto attempt serving a request from the read buffer, and, at leastpartially in parallel, to attempt serving the request from the BTBbanks.
 6. The processor according to claim 1, wherein a given hardwarethread is configured to issue a request to the multi-bank BTB only ifthe given hardware thread is not stalled for more than a predefinednumber of clock cycles.
 7. The processor according to claim 1, wherein,when a given hardware thread is stalled but nevertheless issues arequest to the multi-bank BTB, the given hardware thread is configuredto store a TA that was returned in response to the request, and tosubsequently access the stored TA instead of accessing the BTB banks. 8.The processor according to claim 1, wherein, when a given hardwarethread is stalled, the given hardware thread is configured to delayaccess to the multi-bank BTB.
 9. A processor, comprising: a pipelineconfigured to process program instructions including branchinstructions; and a multi-bank Branch-Target Buffer (BTB), whichcomprises a plurality of BTB banks and is configured to store learnedTarget Addresses (TAs) of one or more of the branch instructions in theplurality of the BTB banks, to receive from the pipeline simultaneousrequests to retrieve respective TAs, and to respond to the requestsusing the plurality of the BTB banks in the same clock cycle, whereinthe multi-bank BTB is configured to: detect that two or more of therequests pertain to the same BTB bank; arbitrate between the requeststhat access the same BTB bank, so as to select a winning request;retrieve a TA from the same BTB bank in response to the winning request;and provide the retrieved TA to a hardware thread that issued thewinning request.
 10. The processor according to claim 9, wherein themulti-bank BTB is configured to multicast the retrieved TA to thehardware thread that issued the winning request, and to at least oneother hardware thread that did not issue the winning request.
 11. Theprocessor according to claim 10, wherein the requests that access thesame BTB bank pertain to at least two different software threads.
 12. Amethod, comprising: in a processor, processing program instructions,including branch instructions, using a pipeline; storing learned TargetAddresses (TAs) of one or more of the branch instructions in amulti-bank Branch-Target Buffer (BTB) that comprises a plurality of BTBbanks; receiving from the pipeline simultaneous requests to retrieverespective TAs; and responding to the requests using the plurality ofthe BTB banks in the same clock cycle, wherein storing the learned TAscomprises storing, in a read buffer, one or more TAs that werepreviously retrieved from at least one of the BTB banks, and whereinresponding to the requests comprises serving one or more of the requestsfrom the read buffer instead of from the BTB banks.
 13. The methodaccording to claim 12, wherein receiving the simultaneous requestscomprises accepting the simultaneous requests from two or more hardwarethreads of the pipeline.
 14. The method according to claim 12, whereinresponding to the requests comprises: detecting that two or more of therequests pertain to the same BTB bank; arbitrating between the requeststhat access the same BTB bank, so as to select a winning request;retrieving a TA from the same BTB bank in response to the winningrequest; and providing the retrieved TA to a hardware thread that issuedthe winning request.
 15. The method according to claim 14, whereinproviding the retrieved TA comprises multicasting the retrieved TA tothe hardware thread that issued the winning request, and to at least oneother hardware thread that did not issue the winning request.
 16. Themethod according to claim 15, wherein the requests that access the sameBTB bank pertain to at least two different software threads.
 17. Themethod according to claim 12, wherein responding to the requestscomprises attempting to serve a request from the read buffer, inresponse to the request losing arbitration for access to the BTB banks.18. The method according to claim 12, wherein responding to the requestscomprises initially attempting to serve a request from the read buffer,and, in response to failing to serve the request from the read buffer,serving the request from at least one of the BTB banks.
 19. The methodaccording to claim 12, wherein responding to the requests comprisesattempting to serve a request from the read buffer, and, at leastpartially in parallel, attempting to serve the request from the BTBbanks.
 20. The method according to claim 12, and comprising issuing arequest to the multi-bank BTB by a given hardware thread only if thegiven hardware thread is not stalled for more than a predefined numberof clock cycles.
 21. The method according to claim 12, and comprising,when a given hardware thread is stalled but nevertheless issues arequest to the multi-bank BTB, storing, by the given hardware thread, aTA that was returned in response to the request, and subsequentlyaccessing the stored TA instead of accessing the BTB banks.
 22. Themethod according to claim 12, and comprising, when a given hardwarethread is stalled, delaying access to the multi-bank BTB by the givenhardware thread.